Search CORE

88 research outputs found

Distributed Data Summarization in Well-Connected Networks

Author: Su Hsin-Hao
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Distributed Computing (DISC 2019)
Publication date: 01/01/2019
Field of study

We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in well-connected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/- epsilon approximates the p-th frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{-2} n^{1-k/p}) roundsfor p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k-1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blow-up in the number of rounds

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Finding Subcube Heavy Hitters in Analytics Data Streams

Author: Kveton Branislav
Muthukrishnan S.
Vu Hoa T.
Xian Yikun
Publication venue
Publication date: 01/01/2018
Field of study

Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of

d

-dimensional items

x_1,\ldots,x_m \in [n]^d

. A

k

-dimensional subcube

T

is a subset of distinct coordinates

\{ T_1,\cdots,T_k \} \subseteq [d]

. A subcube heavy hitter query

{\rm Query}(T,v)

v \in [n]^k

, outputs YES if

f_T(v) \geq \gamma

and NO if

f_T(v) < \gamma/4

, where

f_T

is the ratio of number of stream items whose coordinates

T

have joint values

v

. The all subcube heavy hitters query

{\rm AllQuery}(T)

outputs all joint values

v

that return YES to

{\rm Query}(T,v)

. The one dimensional version of this problem where

d=1

was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in

\tilde{O}(kd/\gamma)

space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is

\Theta(d^2/\gamma)

which is prohibitive for large

d

, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass,

\tilde{O}(d/\gamma)

-space algorithm for our problem, and a fast algorithm for answering

{\rm AllQuery}(T)

O(k/\gamma^2)

time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

arXiv.org e-Print Archive

Crossref

Better Streaming Algorithms for the Maximum Coverage Problem

Author: McGregor Andrew
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 20th International Conference on Database Theory (ICDT 2017)
Publication date: 01/01/2017
Field of study

We study the classic NP-Hard problem of finding the maximum k-set coverage in the data stream model: given a set system of m sets that are subsets of a universe {1,...,n}, find the k sets that cover the most number of distinct elements. The problem can be approximated up to a factor 1-1/e in polynomial time. In the streaming-set model, the sets and their elements are revealed online. The main goal of our work is to design algorithms, with approximation guarantees as close as possible to 1-1/e, that use sublinear space o(mn). Our main results are: 1) Two (1-1/e-epsilon) approximation algorithms: One uses O(1/epsilon) passes and O(k/epsilon^2 polylog(m,n)) space whereas the other uses only a single pass but O(m/epsilon^2 polylog(m,n)) space. 2) We show that any approximation factor better than (1-(1-1/k)^k) in constant passes require space that is linear in m for constant k even if the algorithm is allowed unbounded processing time. We also demonstrate a single-pass, (1-epsilon) approximation algorithm using O(m/epsilon^2 min(k,1/epsilon) polylog(m,n)) space. We also study the maximum k-vertex coverage problem in the dynamic graph stream model. In this model, the stream consists of edge insertions and deletions of a graph on N vertices. The goal is to find k vertices that cover the most number of distinct edges. We show that any constant approximation in constant passes requires space that is linear in N for constant k whereas O(N/epsilon^2 polylog(m,n)) space is sufficient for a (1-epsilon) approximation and arbitrary k in a single pass. For regular graphs, we show that O(k/epsilon^3 polylog(m,n)) space is sufficient for a (1-epsilon) approximation in a single pass. We generalize this to a K-epsilon approximation when the ratio between the minimum and maximum degree is bounded below by K

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Distributed Dense Subgraph Detection and Low Outdegree Orientation

Author: Su Hsin-Hao
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th International Symposium on Distributed Computing (DISC 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

Maximum Coverage in the Data Stream Model: Parameterized and Generalized

Author: McGregor Andrew
Tench David
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are

m

subsets of a universe of size

n

and a value

k\in [m]

. In Max-Cover, the problem is to find a collection of at most

k

sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most

k

sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most

d

, there exist single-pass algorithms using

\tilde{O}(d^{d+1} k^d)

space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant

d

. If each element appears in at most

r

sets, we present single pass algorithms using

\tilde{O}(k^2 r/\epsilon^3)

space that return a

1+\epsilon

approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e.,

\tilde{O}(k^3 r/\epsilon^{4})

space, that

1+\epsilon

approximates Max-Unique-Cover. In contrast to the above results, when

d

and

r

are arbitrary, any constant pass

1+\epsilon

approximation algorithm for either problem requires

\Omega(\epsilon^{-2}m)

space but a single pass

O(\epsilon^{-2}mk)

space algorithm exists. In fact any constant-pass algorithm with an approximation better than

e/(e-1)

and

e^{1-1/k}

for Max-Cover and Max-Unique-Cover respectively requires

\Omega(m/k^2)

space when

d

and

r

are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem.Comment: Conference version to appear at ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

On the Locality of Nash-Williams Forest Decomposition and Star-Forest Decomposition

Author: Harris David G.
Su Hsin-Hao
Vu Hoa T.
Publication venue
Publication date: 15/06/2021
Field of study

Given a graph

G=(V,E)

with arboricity

\alpha

, we study the problem of decomposing the edges of

G

into

(1+\epsilon)\alpha

disjoint forests in the distributed LOCAL model. Barenboim and Elkin [PODC `08] gave a LOCAL algorithm that computes a

(2+\epsilon)\alpha

-forest decomposition using

O(\frac{\log n}{\epsilon})

rounds. Ghaffari and Su [SODA `17] made further progress by computing a

(1+\epsilon) \alpha

-forest decomposition in

O(\frac{\log^3 n}{\epsilon^4})

rounds when

\epsilon \alpha = \Omega(\sqrt{\alpha \log n})

, i.e. the limit of their algorithm is an

(\alpha+ \Omega(\sqrt{\alpha \log n}))

-forest decomposition. This algorithm, based on a combinatorial construction of Alon, McDiarmid \& Reed [Combinatorica `92], in fact provides a decomposition of the graph into \emph{star-forests}, i.e. each forest is a collection of stars. Our main result in this paper is to reduce the threshold of

\epsilon \alpha

(1+\epsilon)\alpha

-forest decomposition and star-forest decomposition. This further answers the

10^{\text{th}}

open question from Barenboim and Elkin's "Distributed Graph Algorithms" book. Moreover, it gives the first

(1+\epsilon)\alpha

-orientation algorithms with {\it linear dependencies} on

\epsilon^{-1}

. At a high level, our results for forest-decomposition are based on a combination of network decomposition, load balancing, and a new structural result on local augmenting sequences. Our result for star-forest decomposition uses a more careful probabilistic analysis for the construction of Alon, McDiarmid, \& Reed; the bounds on star-arboricity here were not previously known, even non-constructively

arXiv.org e-Print Archive